Introduction

The objective of this project is to perform an anlysis of Gun Violence in US in the past decade. The statistics have as their main purpose giving a good overview of the situation for each state and discovering trends and correlations between several factors and gun violence.

We have chosen to employ both Python and R as programming languages, the former being used for data processing and modelling purposes, and the latter for the actual visualization of the data gathered.

Three data sets have been used, the primary one which includes the topic specific information, and two supporting sets that contain the population for each state according to the 2019 census and the state codes.

The data source for the project is Kaggle.

Settings

R specific libraries import

library(hrbrthemes)
## Registered S3 methods overwritten by 'tibble':
##   method     from  
##   format.tbl pillar
##   print.tbl  pillar
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
##       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
##       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
library(reticulate)
library(ggplot2)
library(hrbrthemes)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v tibble  3.0.0     v dplyr   1.0.6
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## v purrr   0.3.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks plotly::filter(), stats::filter()
## x dplyr::lag()    masks stats::lag()
library(gganimate)
library(gifski)
library(viridis)
## Loading required package: viridisLite
library(geojsonio)
## Registered S3 method overwritten by 'geojsonsf':
##   method        from   
##   print.geojson geojson
## 
## Attaching package: 'geojsonio'
## The following object is masked from 'package:base':
## 
##     pretty
library(geojsonR)
library(broom)
library(geojsonsf)
## 
## Attaching package: 'geojsonsf'
## The following object is masked from 'package:geojsonio':
## 
##     geojson_sf
library(wordcloud2)
library(RColorBrewer)
library(dplyr)
library(ggthemes)

Python specific libraries import

import pandas as pd
import numpy as np
import datetime as dt

Data extraction from the CSV files

fileName = 'gun-violence.csv'
fullsetDF = pd.read_csv(fileName)
fileName = 'us_population_by_state.csv'
populationDF = pd.read_csv(fileName)
fileName = 'state_codes.csv'
stateCodesDF = pd.read_csv(fileName)

Data Cleaning

Remove unnecesary data

Fields “incident_id” and “incident_url_fields_missing”, are not need for our analysis so they were removed.

fullsetDF = fullsetDF.drop("incident_id", axis = 1)
fullsetDF = fullsetDF.drop("incident_url_fields_missing", axis = 1)

Processing

We have split the date into months and years since this information will be used later on when studying the changes over the years and if there are any “seasonal changes”.

def splitDates(datesList):
    months = []
    years = []
    for i in range(0, len(datesList)):
        value = datesList[i]
        splitDate = dt.datetime.strptime(value, "%Y-%m-%d")
        monthName = splitDate.strftime('%b')
        months.append(monthName)
        years.append(splitDate.year)
    fullsetDF.insert(1, "MONTH", months)
    fullsetDF.insert(2, "YEAR", years)

allDates = fullsetDF['date']
splitDates(allDates)

We would like to see if the type of venue has any impact on the number of incidents that occur. However, there are no clearly defined types and as such, we had to define them ourselves and see which fall into the specific category.The location can be mentioned in several fields, so we will use all of them for our search: location_description, incident_characteristics, notes and sources. A new field was added called LocationType.

locationType = ["Bar/Club", "College", "School", "Apartment", "Park", "Restaurant", "Courthouse", "Tavern", "Theater"]
locations = fullsetDF["location_description"]
characteristics = fullsetDF["incident_characteristics"]
notes = fullsetDF["notes"]
sources = fullsetDF["sources"]
fieldNo = fullsetDF.columns.get_loc("location_description")

incidentLocationIdentified = []
for i in range (0, len(fullsetDF.index)):
    valueArray = []
    valueArray.append(locations[i])
    valueArray.append(characteristics[i])
    valueArray.append(notes[i])
    valueArray.append(sources[i])
    found = ""
    for j in range(0,4):
        value = valueArray[j]
        if type(value) == str:
            for k in range (0, len(locationType)):
                if locationType[k].lower() in value.lower():
                    found = locationType[k]
                    break
    if(found == ""):
        incidentLocationIdentified.append("N/A")
    else:
        incidentLocationIdentified.append(found)

fullsetDF.insert(fieldNo, "LocationType", incidentLocationIdentified)
fullsetDF = fullsetDF.drop("location_description", axis = 1)
fullsetDF = fullsetDF.drop("notes", axis = 1)

The field incident_characteristics that can be used to find types of incidents. As described previously, we have defined certain categories and searched the text to find in which categories incident fall in.

Unlike the previous example where only one venue was possible, here there can be more than one or more characteristics describing each incident, which required us to define different fields with a YES/NO response (binary categorical variables).

incidentType = ["Home Invasion", "Mass Shooting", "Officer Involved", "Armed Robbery", "Drive-by", "Domestic Violence","Gang"]
fieldNo = fullsetDF.columns.get_loc("incident_characteristics")
matrix = []

for i in range(0, len(incidentType)):
    incidentTypeDiscovered = []
    for j in range(0, len(characteristics)):
        value = characteristics[j]
        if type(value) != float:
          if incidentType[i].lower() in value.lower():
            incidentTypeDiscovered.append("YES")
          else:
            incidentTypeDiscovered.append("NO")
        else:
          incidentTypeDiscovered.append("UNKNOWN")
    fieldName = incidentType[i]
    fullsetDF.insert(fieldNo, fieldName, incidentTypeDiscovered)

Field participant_gender offers information for all the people involved. We would like to get the actual number and add these fields to the dataframe.

genderParticipants = fullsetDF["participant_gender"]
female = []
male = []
for i in range (0, len(genderParticipants)):
    value = genderParticipants[i]
    if type(value) != float:
        no = value.count("Female")
        female.append(no)
        no = value.count("Male")
        male.append(no)
    else:
        female.append(0)
        male.append(0)

fieldNo = fullsetDF.columns.get_loc("participant_gender")
fullsetDF.insert(fieldNo, "FEMALES", female)
fullsetDF.insert(fieldNo, "MALES", male)

We would also like to make the distinction between the number of adults, teenagers and children involved in the incidents and we will use the same process as above on the field participant_age_group.

ageParticipants = fullsetDF["participant_age_group"]
children = []
teenagers = []
adults = []
for i in range (0, len(ageParticipants)):
    value = ageParticipants[i]
    if type(value) != float:
        no = value.count("Adult 18+")
        adults.append(no)
        no = value.count("Teen 12-17")
        teenagers.append(no)
        no = value.count("Child 0-11")
        children.append(no)
    else:
        adults.append(0)
        teenagers.append(0)
        children.append(0)
fieldNo = fullsetDF.columns.get_loc("participant_age_group")
fullsetDF.insert(fieldNo, "ADULTS", adults)
fullsetDF.insert(fieldNo, "TEENAGERS", teenagers)
fullsetDF.insert(fieldNo, "CHILDREN", children)

The clean dataframe was saved in a new csv file.

fullsetDF.to_csv("gun-violence_processed-data.csv")

Graphs

Initial Data Exploration

We will start by using some simple plots to get a first view of the data analyzed. ### Chart Type: Barplot with one numeric variable Calculating total numbers of

  • incidents that occured
  • cases that resulted in killings
  • cases that resulted in injuries

in each state, regardless of month or year.

totalPerState = []
totalKillingsPerState = []
totalInjuriesPerState = []

for i in range (0, len(stateCodesDF)):
    state = stateCodesDF["State"][i]

    dfHelper = fullsetDF[fullsetDF["state"]==state]['state'].copy(deep=True)
    totalPerState.append(dfHelper.count().item())

    dfHelper = fullsetDF[(fullsetDF["state"] == state) & (fullsetDF['n_killed'] > 0)]['state'].copy(deep=True)
    totalKillingsPerState.append(dfHelper.count().item())

    dfHelper = fullsetDF[(fullsetDF["state"] == state) & (fullsetDF['n_injured'] > 0)]['state'].copy(deep=True)
    totalInjuriesPerState.append(dfHelper.count().item())

stateCodes = stateCodesDF["Code"]

We have defined a function for this type of plot and called it for each of the scenarios. A barplot uses a single numeric variable and a categoric variable.

initial_eda_barplots <- function(all_labels, all_values, x_name, y_name, plot_title, colour){
  
  data=data.frame(name = all_labels, value = all_values)

  totalIncidentsBarPlot <- ggplot(data, aes(x=name, y = value)) + geom_bar(stat = "identity", width=0.7, fill = colour) + 
    xlab(x_name) + 
    ylab(y_name) +
    ggtitle(plot_title) +
    theme(
      plot.title = element_text(hjust = 0.5),
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      panel.background = element_rect(fill = "white"), 
      axis.text.x = element_text(angle = 90, vjust = 0.5)
  )
  ggplotly(totalIncidentsBarPlot)
}

We first want to see the distribution of the total number of incidents per each state. We will use the totalPerState dataframe and the codes from the stateCodes array.

Numeric Variable: Total number of incidents that occured

Categoric Variable: State codes

r_totalPerState <- py$totalPerState
r_labels <- py$stateCodes

initial_eda_barplots(r_labels, r_totalPerState, "State Code", "Number of incidents", "Distribution of incidents per State", "#091CC1")

The general level of incidents can roughly be found somewhere in the interval [0,5000], but we do see some outliers - states have a significantly higher number of gun-involved cases.

Numeric Variable: Total number of cases that resulted in deaths

Categoric Variable: State codes

r_totalKillingsPerState <- py$totalKillingsPerState

initial_eda_barplots(r_labels, r_totalKillingsPerState, "State Code", "Number of killings", "Distribution of cases that resulted in killings per State", "#652194")

The number of deaths for most states lies between 0 and 3000, with the overwhelming majority being between 0 and 2000. Outlier values go up to 5000.

Numeric Variable: Total number of cases that resulted in injuries

Categoric Variable: State codes

r_totalInjuriesPerState <- py$totalInjuriesPerState

initial_eda_barplots(r_labels, r_totalInjuriesPerState, "State Code", "Number of injuries", "Distribution of cases that resulted in injuries per State", "#134611")

Number of injuries per state is between 0 and 6000 for all states but one - that exception being over 9000.

Outlier values can be noticed in the last two charts as well. Given this observation, it would make sense to realize another chart - a Boxplot - that would specifically highlight this information. Even though boxplots can sometimes be misleading because of the loss of information, our purpose for this scenario is to identify those states that do not fall into the IQR (Interquantile Range), purpose which makes this type of chart appropriate.

Chart Type: Boxplot with three series

A boxplot uses one or more numeric variables and since we would like to make a comparison between the three series, we will plot them all together.

Three Numeric Variables: Total number of incidents for each state Total number of firearm deaths for each state Total number of firearm injuries for each state

r_totalPerState <- py$totalPerState
r_totalKillingsPerState <- py$totalKillingsPerState
r_totalInjuriesPerState <- py$totalInjuriesPerState

p <- boxplot(r_totalPerState, r_totalKillingsPerState, r_totalInjuriesPerState,
main = "Outlier variables - States that have higher criminality",
names = c("Number of incidents", "Number of killings", "Number of injuries"),
col = c("#091CC1","#652194", "#134611")
)

outliers <- p$out

From the boxplots, the distribution for all 3 of them (number of incidents, killings and injuries) is skewed towards lower values, with the median finding itself closer to the bottom of the chart. The outlier values can also be clearly seen, being outside the range of the whiskers of the box plots

Extracting the outlier values from the boxplot representation:

outliers = r.outliers
actualStates = []

for i in range(0,4):
  for j in range (0, len(stateCodes)):
    if(totalPerState[j] == outliers[i]):
      actualStates.append(stateCodes[j])
      break

for i in range(4,6):
  for j in range (0, len(stateCodes)):
    if(totalKillingsPerState[j] == outliers[i]):
      actualStates.append(stateCodes[j])
      break

for i in range(6,7):
  for j in range (0, len(stateCodes)):
    if(totalInjuriesPerState[j] == outliers[i]):
      actualStates.append(stateCodes[j])
      break

actualStates = list(dict.fromkeys(actualStates))
outlierList = []
for i in range(0, len(actualStates)):
  outlierList.append(stateCodesDF[stateCodesDF["Code"] == actualStates[i]]["State"].item())

Chart Type: Doughnut Plot

We have decided to further investigate the outlier states so as to better understand the factors that led to an increased number of incidents involving firearms.

The first type of analysis targets the age categories. For this purpose, the data was visualized in separate doughnut charts as seen below.

Calculating the percentage of Child, Teenager and Adult participants for each state.

outlierList = np.array(outlierList)
totalsStates = np.zeros(shape=(len(outlierList),3), dtype = float)
categories = np.array(["CHILDREN", "TEENAGERS", "ADULTS"])

for i in range(0,len(outlierList)):
    state = outlierList[i]
    array = []

    dfHelper = fullsetDF[fullsetDF["state"]==state][categories[0]].copy(deep=True)
    totalsStates[i][0] = dfHelper.sum().item()

    dfHelper = fullsetDF[fullsetDF["state"]==state][categories[1]].copy(deep=True)
    totalsStates[i][1] = dfHelper.sum().item()

    dfHelper = fullsetDF[fullsetDF["state"]==state][categories[2]].copy(deep=True)
    totalsStates[i][2] = dfHelper.sum().item()

# Percentages per state for each group
for i in range(0, len(outlierList)):
    total = totalsStates[i][0] + totalsStates[i][1] + totalsStates[i][2]
    for j in range(0, 3):
        totalsStates[i][j] = float("{:.4f}".format(totalsStates[i][j] / total))*100
donut_plot_labels <- function(all_labels, all_values, title) {
  
data = data.frame(name = all_labels, value = all_values)

data$ymax = cumsum(all_values)
data$ymin = c(0, head(data$ymax, n=-1))

data$labelPosition <- (data$ymax + data$ymin) / 2
data$label <- paste0(data$value, "%")

ggplot(data, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=name)) + 
  geom_rect() +
  geom_text(x=2, aes(y=labelPosition, label=label, color=name), size=3) +
  scale_fill_brewer(palette = "Dark2") +
  scale_color_brewer(palette = "Dark2") +
  coord_polar(theta="y") +
  xlim(c(-1, 4)) +
  ggtitle(title) +
  theme(
    axis.title.x = element_blank(),
    axis.title.y = element_blank())
}

Numeric Variables: Percentage of each age group involved in an incident in one state

r_categories <- py$categories
r_outlierList <- py$outlierList

matrix <- py$totalsStates
for(row in 1 : nrow(matrix)){
  r_totalsStates<- matrix[row,]
  print(donut_plot_labels(r_categories, r_totalsStates, r_outlierList[row]))
}

In all of the four graphs we can see that the majority of people involved in gun-violence are adults, followed by teenagers and the children. This might be an expected result, howver plotting this information made us discover that there are differences in the age group percentages across the states.

For this reason, we will also design an overview of the age groups across the outlier states to gain a better understanding of the aspect we just noticed.

# Percentages per group for each state
for j in range(0, 3):
    total = 0
    for i in range(0, len(outlierList)):
        total = total + totalsStates[i][j]

    for i in range(0, len(outlierList)):
        totalsStates[i][j] = float("{:.4f}".format(totalsStates[i][j] / total))*100
totalsStates = totalsStates.transpose()

Numeric Variables: Percentage per state for one age group

matrix <- py$totalsStates
for(row in 1 : nrow(matrix)){
  r_totalsStates<- matrix[row,]
  print(donut_plot_labels(r_outlierList, r_totalsStates, r_categories[row]))
}

For the Adults category, the differences between the states are generally not that significant, the values for each state being somewhere around 25-40%.

The results for the other two categories have, however, glaring differences. The percentage of children involved is higher in the state of Texas and Florida than in the other two states, while the percentage of Teenagers is higher in Illinois and Florida.

Chart Type: Circular BarPlot

Circular Barplots are a variation of the a barplot and it uses one numerical variable and one categorical.

The purpose of the following section is to identify the number of guns used in all the incidents per state. We have noticed that there are several incidents with at least 3 guns involved.

gunsTotalPerState = []
for i in range (0, len(stateCodesDF)):
    state = stateCodesDF["State"][i]
    dfHelper = fullsetDF[fullsetDF["state"]==state]["n_guns_involved"].copy(deep=True)
    gunsTotalPerState.append(dfHelper.sum().item())
gunsTotalPerState = np.array(gunsTotalPerState)
circular_barplot <- function(all_labels, all_values, x_name, y_name, plot_title, colour){
  
  data=data.frame(name = all_labels, value = all_values)

  totalIncidentsBarPlot <- ggplot(data, aes(x=name, y = value)) + geom_bar(stat = "identity", width=0.7, fill = colour) + 
    xlab(x_name) + 
    ylab(y_name) +
    ggtitle(plot_title)+
    theme(
      plot.title = element_text(hjust = 0.5),
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      panel.background = element_rect(fill = "white")
  ) +
  coord_polar(start = 0)
  
  options(repr.plot.width = 40, repr.plot.height = 40)
  totalIncidentsBarPlot
}

Numeric Variable: Total number of guns

Categoric Variable: State codes

r_gunsTotalPerState <- py$gunsTotalPerState

circular_barplot(r_labels, r_gunsTotalPerState, "State Code", "Number of guns", "Distribution of no of guns involved per State", "#091CC1")

We can clearly see that there are several states where the number of guns is a lot higher, specifically California, Florida, Texas and Illinois. As we remember from the initial plots, these are the exact states that had the highest number of incidents as well.

Further Analysis

Chart Type: Heatmap

The variables used for a heatmap are two categoric ones.

We have previously split the incident type into 7 different categories. We would now like to make an analysis of the causes of gun-violence, i.e. how often each type occurs for the individual states.

typeTotalPerState = np.empty(shape=(len(incidentType)*len(stateCodesDF)))
it = 0
for i in range(0, len(incidentType)):
    incident = incidentType[i]

    for j in range (0, len(stateCodesDF)):
        state = stateCodesDF["State"][j]
        dfHelper = fullsetDF[(fullsetDF["state"]==state) & (fullsetDF[incident] == "YES")][incident].copy(deep=True)

        typeTotalPerState[it] = dfHelper.count()
        it = it + 1

def dataCleaning(dataframe):
  allYears = [2013, 2014, 2015, 2016, 2017, 2018]
  for i in range(0, len(statesAll)):
      for j in range(i*6, i*6+6):
          year = allYears[j - i*6]
          if (dataframe.iloc[j]["YEAR"] != year):
              insertValue = [statesAll[i], year, 0]
              insertRow(j, dataframe, insertValue)
              dataframe = dataframe.sort_index(axis=0)
              j = j-1

Categorical variables: Incident Type and the States Codes

r_typeTotalPerState <- py$typeTotalPerState
x <- py$stateCodes
y <- py$incidentType
data <- expand.grid(X=x, Y=y)

data$Z <- r_typeTotalPerState

ggplot(data, aes(X, Y, fill=Z)) + 
 geom_tile() + 
  scale_fill_distiller(palette = "Greens") + 
  theme(
    axis.text.x = element_text(angle = 90, vjust = 0.5)
  )

The most common type of incident is a mass shooting, followed by gang fights, while the other types (such as domestic violence, home invasion etc.) are less common.

Chart Type: Scatter Plot

The next analysis targets the number of females and males involved in incidents for each state as we want to see if this might also be considered a factor. Statistics generally show that women commit less crimes than men. However, this is a very general statement and we want to see if it is also applicable for gun-involved incidents in our time frame.

A scatter plot uses two unordered numeric variables.

malefemaleDistribution = fullsetDF[["state", "FEMALES", "MALES"]]
malefemaleDistribution = malefemaleDistribution.groupby([malefemaleDistribution["state"]], as_index=False).agg({"FEMALES" : "sum", "MALES" : "sum"})
malefemaleDistribution = malefemaleDistribution[["FEMALES", "MALES"]]

malefemaleDistribution.index = statesAll
simpleScatter <- function(data){
  ggplot(data, aes(x=FEMALES, y=MALES)) + 
    geom_point() +
    geom_text(label=rownames(data), check_overlap = T)
}

Numeric Variables: Total number of females involved, Total number of males involved

r_data <- py$malefemaleDistribution

simpleScatter(r_data)

The distribution between the two is fairly equal. We do see some states that have a higher level of males involved (Illinois or California, Arkansas), but we also see states with a higher number of females involved (Florida, Texas, Georgia).

Chart Type: WordCloud

A wordcloud uses a numeric and a categoric variable. It uses a list of words, each of them having a frquency which is expressed in the graph through the size of the word shown.

Lastly, we want to see if there are locations where the level of crime is higher than in the others. We do have a lot of unknown locations in the dataset, however, just to give an idea of where most incidents occurred, we have plotted a Word Chart as seen below.

column_names = ["word", "freq"]


locationTypeTotal = pd.DataFrame(columns = column_names)

for i in range (0, len(locationType)):
    array = []
    location = locationType[i]
    dfHelper = fullsetDF[fullsetDF["LocationType"]==location]["LocationType"].copy(deep=True)
    array.append(location)
    array.append(dfHelper.count().item())

    locationTypeTotal.loc[-1] = array
    locationTypeTotal.index = locationTypeTotal.index + 1
    locationTypeTotal = locationTypeTotal.sort_index()

for i in range(0, len(locationTypeTotal)):
    locationTypeTotal = locationTypeTotal.rename(index={i: locationTypeTotal["word"][i]})
    
words = np.array(locationTypeTotal["word"])
freq = np.array(locationTypeTotal["freq"], dtype=int)
wordGraph <- function(dataFreq) {
  wordcloud2(data=dataFreq, size=1)
}

Numeric Variable: The total number of incidents recorded at each known venue

Categoric Variable: The list of venues

r_words <- py$words
r_freq <- py$freq
r_location <- data.frame(word = r_words, freq = r_freq)

wordGraph(r_location)

As one would expect, the places with the highest rate of gun-violence are parks, apartments(i.e. at home), schools or bars.

Chart Type: Doughnut Chart

Calculating total number of incidents per state and adding them into a dictionary. The states with incident percentages less than 1% will be categorized as ‘others’.

totalPerState = []
incidentsStatesArray = np.array(fullsetDF['state'].to_numpy())
for i in range(0, len(stateCodesDF)):
    state = stateCodesDF["State"][i]
    totalPerState.append(incidentsStatesArray[incidentsStatesArray == state].shape[0])
        
DictStateIncident = {"Others": 0}
dfHelper = fullsetDF[fullsetDF["state"]==state]['state'].copy(deep=True)
totalIncidents = len(dfHelper)

incidentsPerStatePercentages = (np.array(totalPerState) / totalIncidents)
for i in range(0,len(stateCodes)):
    if incidentsPerStatePercentages[i] > 1:
        DictStateIncident[stateCodes[i]] = round(incidentsPerStatePercentages[i], 2)
    else:
        DictStateIncident["Others"] = DictStateIncident["Others"] + round(incidentsPerStatePercentages[i], 2)
        
statesForDonut = np.array(list(DictStateIncident.keys()))
valuesForDonut = np.array(list(DictStateIncident.values()))
donut_plot <- function(all_labels, all_values) {
  
data = data.frame(name = all_labels, value = all_values)

data$ymax = cumsum(all_values)
data$ymin = c(0, head(data$ymax, n=-1))

ggplot(data, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=name)) + 
  geom_rect() +
  coord_polar(theta="y") +
  xlim(c(2, 4)) +
  scale_size(range=c(2,12), guide="none") +
  theme(
    panel.grid.major.x = element_blank(),
    panel.border = element_blank(),
    axis.ticks.x = element_blank(),
    plot.title = element_text(size = 12)
  ) +
  xlab("") +
  ylab("Number of incidents reported per each state")

}
r_labels <- py$statesForDonut
r_percentagesPerState <- py$valuesForDonut

donut_plot(r_labels, r_percentagesPerState)

Calculating the number of incidents per geographical region: North-East (NE), North-West(NW), South-West(SW), South-East(SE).

We consider the center position the geographic center of the United States. This is a point approximately 32 km north of Belle Fourche, South Dakota at ‘LAT. 39°50’ LONG. −98°35’.

Variables: latitude, longitude

DictIncidentsPerRegion = {"NE": 0, "NW": 0, "SW": 0, "SE": 0}

incidentsCoordinates = fullsetDF[["latitude", "longitude"]]

for i in range(0, len(incidentsCoordinates)):
    if incidentsCoordinates["latitude"][i] > 39 and incidentsCoordinates["longitude"][i] > -98:
        DictIncidentsPerRegion["NE"] += 1
    if incidentsCoordinates["latitude"][i] > 39 and incidentsCoordinates["longitude"][i] < -98:
        DictIncidentsPerRegion["NW"] += 1
    if incidentsCoordinates["latitude"][i] < 39 and incidentsCoordinates["longitude"][i] < -98:
        DictIncidentsPerRegion["SW"] += 1
    if incidentsCoordinates["latitude"][i] < 39 and incidentsCoordinates["longitude"][i] > -98:
        DictIncidentsPerRegion["SE"] += 1

totalNoIncidents = sum(DictIncidentsPerRegion.values())
  
regions = np.array(list(DictIncidentsPerRegion.keys()))
incidentsPerRegions = np.round(np.array((np.array(list(DictIncidentsPerRegion.values())) / totalNoIncidents) * 100), 2)
r_regions <- py$regions
r_incidentsPerRegion <- py$incidentsPerRegions

donut_plot_labels(r_regions, r_incidentsPerRegion, "Incidents per region in %")

Observations: Most of the gun-involved crimes are hapenning in the East side of America with almost 83% of the total number. Comparing the north of the country with the south side, more incidents happened in the last one mentioned.

Chart Type: Lollipop chart with one numeric variable

In this section we want to highlight the top 25 and the bottom 25 news websites that reported crimes, which would help decide which online newspaper are the most and least trustworthy.

Variables: sources

Method: We count the number of occurences of a specific source thorugh all the incidents reported. At the end, we sort by this number and plot a graph with the names of the online source and the number of news that each one has reported over the period.

sourcesPerIncident = fullsetDF[["date", "sources"]]

DictSourcesAndIncidents = {"pittsburgh.cbslocal.com" : 0}
for i in range (0, len(sourcesPerIncident["sources"])):
    try:
        sourcesArray = sourcesPerIncident["sources"][i].split("||")
    except:
        continue
    for source in sourcesArray:
        URI = source.strip().partition("//")[2]
        URL = URI.partition("/")[0]
        if URL in DictSourcesAndIncidents:
            DictSourcesAndIncidents[URL] += 1
        else:
            DictSourcesAndIncidents[URL] = 0

lessIncidentsPerSource = dict(filter(lambda elem: elem[1] != 0, DictSourcesAndIncidents.items()))

sort_by_value = lambda DictSourcesAndIncidents: DictSourcesAndIncidents[1]
sortedSourcesDict = sorted(DictSourcesAndIncidents.items(), key = sort_by_value, reverse=True)

sourcesName = []
sourceNews = []
for x in list(sortedSourcesDict)[0:25]:
    sourcesName.append(x[0])
    sourceNews.append(x[1])
    
sortedSourcesLeastTrustful = sorted(lessIncidentsPerSource.items(), key = sort_by_value)

sourcesLeastTrustful = []
appearancesLeastTrusful = []
for x in list(sortedSourcesLeastTrustful)[0:25]:
    sourcesLeastTrustful.append(x[0])
    appearancesLeastTrusful.append(x[1])

For this information, we chose to display a lollipop graph which is very similar with the histograms, and we added the hover over the graph to see details for each record.

library(plotly)

lollipop_function <- function(sources, appearances, title) {
    
    data <- data.frame(x = sources, y = appearances)
    
    p <- ggplot(data, aes(x=x, y=y)) +
      geom_segment( aes(x=reorder(x, y), xend=x, y=0, yend=y), color="grey") +
      geom_point( color="orange", size=4, fill=alpha("orange", 0.3), alpha=0.7, shape=21, stroke=2) +
      theme_light() +
      coord_flip() +
      ggtitle(title) +
      scale_size(range=c(2,12), guide="none") +
      theme(
        panel.grid.major.x = element_blank(),
        panel.border = element_blank(),
        axis.ticks.x = element_blank(),
        plot.title = element_text(size = 12)
      ) +
      xlab("") +
      ylab("Number of incidents reported for the whole period")
    
    ggplotly(p)
}
r_source <- py$sourcesName
r_news <- py$sourceNews

lollipop_function(r_source, r_news, "The most trustworthy websites that reported gun-violence incidents")

From the above plot, we can observe that Chicago Suntimes is by far the platform that investigates and reports the majority of crime related news, with 7060 articles in those 6 years of this study. Above is the top 25 most trustworthy webistes, from which we can also see that Facebook and Twitter are also used for this kind of news.

r_source <- py$sourcesLeastTrustful
r_news <- py$appearancesLeastTrusful

lollipop_function(r_source, r_news, "The least trustworthy websites that reported gun-violence incidents")

There are many webistes which have only one incident of this kind reported over these years.

Chart Type: Choropleth charts

Variables: state

We are computing the number of incidents per state.

In order to plot a choropleth map, we need a geo-json file with the information about the borders of the state in terms of latitude/longitude. After this, we merge the two data sets by an id (in this case the id being the name of the state) and plot the figure below.

choroplethPlot <- function(spdf_fortified, title) {

  ggplot(data = spdf_fortified, aes( x = long, y = lat, group = group)) +
      geom_polygon(aes(fill = values), color = "black", size = 0.1) +
      coord_map() +
      labs(
        title = "US states map",
        subtitle = title
      ) +
      scale_fill_gradient(high = "#e34a33", low = "#fee8c8", guide = "colorbar") +
      theme(
        text = element_text(color = "#22211d"),
        plot.background = element_rect(fill = "#f5f5f2", color = NA),
        panel.background = element_rect(fill = "#f5f5f2", color = NA),
        legend.background = element_rect(fill = "#f5f5f2", color = NA),
  
        plot.title = element_text(size= 14, hjust=0.01, color = "#4e4d47", margin = margin(b = -0.1, t = 0.4, l = 2, unit = "cm")),
        plot.subtitle = element_text(size= 12, hjust=0.01, color = "#4e4d47", margin = margin(b = -0.1, t = 0.43, l = 2, unit = "cm")),
        plot.caption = element_text( size=12, color = "#4e4d47", margin = margin(b = 0.3, r=-99, unit = "cm") ),
  
        legend.position = c(0.9, 0.2)
      ) +
      xlim(c(-130, -65)) +
      ylim(c(20, 50))

}

Counting the number of incindents per state:

incidentsPerState = fullsetDF["state"].value_counts(ascending=False)

statesForMap = np.array(incidentsPerState.index)
incidentsForMap = np.array(incidentsPerState.values)

Read thegeo-json file and compute the plot:

file <- system.file("examples", "us_states.json", package = "geojsonio")
spdf <- geojson_read(file, what = "sp")

spdf_fortified <- tidy(spdf, region = "id")
## SpP is invalid
r_states <- py$statesForMap
r_incidents <- py$incidentsForMap

numericData <- data.frame(id = r_states, values = r_incidents)

spdf_fortified <- merge(spdf_fortified, numericData, by="id")

choroplethPlot(spdf_fortified, "Number of gun-violence incidents per state")

From the above plot, there is a clear confirmation that on the east side there are more incidents than in the west region (as one of the donught plot displayed also). However, this graph display in a more user-friendly way the states with the most number of incidents over this period of time.

killsPerState = fullsetDF[["state", "n_killed"]]

killsPerState = killsPerState.groupby([killsPerState["state"]], as_index=False).agg({"n_killed" : "sum"})
statesList = killsPerState["state"] 
deathsList = killsPerState["n_killed"] 
file <- system.file("examples", "us_states.json", package = "geojsonio")
spdf <- geojson_read(file, what = "sp")

spdf_fortified <- tidy(spdf, region = "id")
## SpP is invalid
r_states_list <- py$statesList
r_kills_per_state <- py$deathsList

numericData <- data.frame(id = r_states_list, values = r_kills_per_state)

spdf_fortified <- merge(spdf_fortified, numericData, by="id")

choroplethPlot(spdf_fortified, "Number of deaths resulting in crimes per state")

The pattern remains the same, which means it is a proprtional relationship between the number of incidents and the number of deaths resulted from the them.

Chart Type: Scatter Plot with line predictions and animated plots

Numberic Variables: Dates of incidents, number of people killed, injured and the number of guns involved

The aim was to aggregate the number of deaths, the number of people injured and the number of guns involved by each month over the 6 years and to compute the liniar trend. After investigations, we found out some discrepancies in the dataset, such that in the first year of this study, the figures were remarkably lower than the following years. Thus, we filtered that data and the results can be shown below.

peopleAggregatedByDate = fullsetDF[["date", "MONTH", "YEAR", "n_killed", "n_injured", "n_guns_involved"]]

peopleAggregatedByDate = peopleAggregatedByDate.groupby([peopleAggregatedByDate["YEAR"], peopleAggregatedByDate["MONTH"] ], as_index=False)\
    .agg({"n_killed" : "sum", "n_injured" : "sum", "n_guns_involved" : "sum", "date": np.max}).sort_values("date", ascending=True)

timeSeriesKilled = peopleAggregatedByDate[["date", "n_killed"]]
timeSeriesInjured = peopleAggregatedByDate[["date", "n_injured"]]
timeSeriesGuns = peopleAggregatedByDate[["date", "n_guns_involved"]]

n_killed = ["n_killed"] * len(timeSeriesKilled)
n_injured = ["n_injured"] * len(timeSeriesInjured)
n_guns = ["n_guns"] * len(timeSeriesGuns)

timeSeriesKilled = pd.DataFrame({
    "date": timeSeriesKilled["date"].to_numpy(),
    "variable": n_killed,
    "value": timeSeriesKilled["n_killed"].to_numpy()
})
timeSeriesKilled["index"] = timeSeriesKilled.index

timeSeriesInjured = pd.DataFrame({
    "date": timeSeriesInjured["date"].to_numpy(),
    "variable": n_injured,
    "value": timeSeriesInjured["n_injured"].to_numpy()
})
timeSeriesInjured["index"] = timeSeriesInjured.index

timeSeriesGuns = pd.DataFrame({
    "date": timeSeriesGuns["date"].to_numpy(),
    "variable": n_guns,
    "value": timeSeriesGuns["n_guns_involved"].to_numpy()
})
timeSeriesGuns["index"] = timeSeriesGuns.index

timeSeries = pd.concat([timeSeriesKilled, timeSeriesInjured, timeSeriesGuns])
timeSeriesDate = timeSeries["date"].to_numpy()
timeSeriesValue = timeSeries["value"].to_numpy()
timeSeriesIndex = timeSeries["index"].to_numpy()
timeSeriesVariable = timeSeries["variable"].to_numpy()

filteredDataAfterFirstYear = timeSeriesKilled[pd.to_datetime(timeSeriesKilled["date"]) >= pd.to_datetime('2014-01-01')]
filteredDataInjuredAfterFirstYear = timeSeriesInjured[pd.to_datetime(timeSeriesInjured["date"]) >= pd.to_datetime('2014-01-01')]
filteredDataGunsAfterFirstYear = timeSeriesGuns[pd.to_datetime(timeSeriesGuns["date"]) >= pd.to_datetime('2014-01-01')]
r_date = py$filteredDataAfterFirstYear["date"]
r_value = py$filteredDataAfterFirstYear["value"]
data = data.frame(date = r_date, value = r_value)

data$date <- as.POSIXct(data$date, format = "%Y-%m-%d")

ggplot(data, aes(x=date, y=value)) +
  geom_point() +
  ggtitle("Evolution of number of murders in time after 2014") +
  geom_smooth(method=lm , color="red", fill="#69b3a2", se=TRUE) +
  theme(
    axis.text.x = element_text(angle = 90)
  )
## `geom_smooth()` using formula 'y ~ x'

r_date = py$filteredDataInjuredAfterFirstYear["date"]
r_value = py$filteredDataInjuredAfterFirstYear["value"]
data = data.frame(date = r_date, value = r_value)

data$date <- as.POSIXct(data$date, format = "%Y-%m-%d")

ggplot(data, aes(x=date, y=value)) +
  geom_point() +
  ggtitle("Evolution of number of injured people in time after 2014") +
  geom_smooth(method=lm , color="red", fill="#69b3a2", se=TRUE) +
  theme(
    axis.text.x = element_text(angle = 90)
  )
## `geom_smooth()` using formula 'y ~ x'

r_date = py$filteredDataGunsAfterFirstYear["date"]
r_value = py$filteredDataGunsAfterFirstYear["value"]
data = data.frame(date = r_date, value = r_value)

data$date <- as.POSIXct(data$date, format = "%Y-%m-%d")

ggplot(data, aes(x=date, y=value)) +
  geom_point() +
  ggtitle("Evolution of number of guns invloved in crimes over time after 2014") +
  geom_smooth(method=lm , color="red", fill="#69b3a2", se=TRUE) +
  theme(
    axis.text.x = element_text(angle = 90)
  )
## `geom_smooth()` using formula 'y ~ x'

#### Conclusions: In all of the above, the trend was an increase of the number of incidents in the United States, between 2014 and 2018.

Chart Type: Animated graphs

Also, the below plot shows in an animated fashion the increase of the above mentioned metrics over the time. We could also observe that the number of guns involved grew faster in the last years than the other two variables.

names = c("n_killed", "n_injured", "n_guns")
r_dates = py$timeSeriesDate
data = data.frame(x = py$timeSeriesDate, y = py$timeSeriesValue, variable=py$timeSeriesVariable, color=py$timeSeriesVariable)

data$x <- as.POSIXct(data$x, format = "%Y-%m-%d")

options(width = 800, height = 3)
ggplot(data, aes(x = x, y = y, group = variable, color = color)) +
    geom_line(linetype = 1, lwd = 0.5) +
    geom_point() +
    scale_color_viridis(discrete = TRUE) + theme_bw() +
    ggtitle("Evolution in time of number of murders, injured people and used guns") +
    theme(
      plot.title = element_text(hjust = 0.5),
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      panel.background = element_rect(fill = "white"), 
      axis.text.x = element_text(angle = 90, size = 7),
      axis.line = element_line(colour = "darkblue", size = 0.7, linetype = "solid"),
      legend.position=c(0.1, 0.8)
    ) +
    transition_reveal(data$x)

Chart Type: Scatter Plot animated

Numberic Variables: number of males and females involved, numver of deaths occured, the status of mass shooting

In the below plot we you can observe the evolution of the number of people invloved vs the deaths occured over the years.

fullsetDF["people_involved"] = fullsetDF["MALES"] + fullsetDF["FEMALES"]
dataForNumberOfPeopleVsKilled = fullsetDF[["YEAR", "n_killed", "people_involved", "Mass Shooting"]]
print(dataForNumberOfPeopleVsKilled.head())
##    YEAR  n_killed  people_involved Mass Shooting
## 0  2013         0                4           YES
## 1  2013         1                1           YES
## 2  2013         1                5            NO
## 3  2013         4                4            NO
## 4  2013         2                4            NO
n_people = dataForNumberOfPeopleVsKilled["people_involved"]
n_killed = dataForNumberOfPeopleVsKilled["n_killed"]
years = dataForNumberOfPeopleVsKilled["YEAR"]
mass_shooting = dataForNumberOfPeopleVsKilled["Mass Shooting"]
r_n_people <- py$n_people
r_n_killed <- py$n_killed
r_years <- as.numeric(py$years)
r_mass_shooting <- py$mass_shooting

data = data.frame(x = r_n_people, y = r_n_killed, color=r_mass_shooting, years=r_years)

ggplot(data, aes(x=r_n_people, y=r_n_killed, color=r_mass_shooting)) +
  geom_point(alpha = 0.7, stroke = 0) +
  theme_classic() +
  scale_size(range=c(2,12), guide="none") +
  labs(title = "Number of people involved VS number of people killed for the same incident",
       x = "People involved",
       y = "People killed",
       color = "Mass Shooting") +
  theme(axis.title = element_text(),
        legend.text = element_text(size = 6)) +
  scale_color_brewer(palette = "Set2") +
  transition_time(years) +
  labs(subtitle = "Year: {frame_time}") +
  shadow_wake(wake_length = 0.1) +
  ease_aes('linear')

We can observe the following pattern: an increase number of people invloved in the accident resulted in a higher number of deaths. Also, there is a clear differentiation between the results where a mass shooting took place and where not.

Overall Conclusions

A couple of conclusions can be drawn from the analysis undertaken, especially in relation to the impact several factor have on the level of gun-violence.

There are four states where the criminality rate is significant higher than in the others, some of them actually having a incredibly high number of children and teenagers involved. Most of the incidents were mass shooting. More incidents were recorded in the Eastern part of the country than in the Western.